Joint environment and speaker normalization using factored front-end CMLLR
نویسندگان
چکیده
The problem of joint compensation of environment and speaker variabilities is addressed. A factored feature-space transform, named factored front-end CMLLR (F-FE-CMLLR), is investigated, which comprises of the cascade of two transforms – front-end CMLLR for environment normalization and CMLLR for speaker normalization. In this paper, we propose an iterative estimation algorithm for F-FE-CMLLR. We believe that the iterative estimation helps to decouple the effect of the two acoustic factors, allowing each transform to learn the effect of only factor, thereby yielding an improvement in speech recognition performance compared to sequential estimation. However, it is noted that the estimation of environment transform yields full co-variance Gaussians in the GMM-HMM, which makes direct estimation computationally expensive. An efficient training algorithm is presented that helps to reduce the computational cost considerably. Further, it is shown that a row-by-row optimization procedure can be employed, which makes the algorithm more efficient and attractive. On the multi-condition Aurora 4 task and discriminatively trained GMM-HMM, it is shown that F-FE-CMLLR yields 11.6% and 8.7% relative improvements on two evaluation sets over the baseline features that is processed only by CMLLR for speaker normalization.
منابع مشابه
Separating Speaker and Environmental Variability Using Factored Transforms
Two primary sources of variability that degrade accuracy in speech recognition systems are the speaker and the environment. While many algorithms for speaker or environment adaptation have been proposed to improve performance, far less attention has been paid to approaches which address for both factors. In this paper, we present a method for compensating for speaker and environmental mismatch ...
متن کاملModel-Based Approaches for Degraded Channel Modelling in Robust ASR
Speech is usually observed after passing through some form of “channel” that results in distortions. For some scenarios it is possible to build explicit models of this channel distortion and hence compensate the acoustic models. However the accuracy of the distortion model is sometimes poor and more general adaptation approaches are required. This paper investigates these model-based approaches...
متن کاملAdaptive Training Using Simple Target Models
Adaptive training aims at reducing the influence of speaker, channel and environment variability on the acoustic models. We describe an acoustic normalization approach to adaptive training. Phonetically irrelevant acoustic variability is reduced at the beginning of the training procedure w. r. t. a set of target models. The set of target models can be a set of HMMs or a Gaussian mixture model (...
متن کاملMLLR techniques for speaker recognition
Maximum-Likelihood Linear Regression (MLLR) and Constrained MLLR (CMLLR) have been recently used for feature extraction in speaker recognition. These systems use (C)MLLR transforms as features that are modeled with Support Vector Machines (SVM). This paper evaluates and compares several of these approaches for the NIST Speaker Recognition task. Single CMLLR and up to 4-phonetic-class MLLR trans...
متن کاملFeature Level Compensation for Robust Speaker Identification in Mismatched Conditions
In this paper, robust front end features are proposed for improvement in speaker identification (SI) performance by considering the factors of real world situations, like mismatch between training and testing conditions. The most commonly used MFCC features are very much sensitive to effects such as channel and environment mismatch. Characteristics of speech gets changed with room acoustics, ch...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015